Bachelor in Data Science and Engineering
Bayesian Data Analysis

Objective

The objective of this project is to understand how to use the EM algorithm, Gibbs sampling and Variational Bayes. In order to do that, I would like to apply those concepts into a practical case study.

Addiction to series dataset

In order to create the Series.csv dataset, I have made a google form to distribute it to some people of different ages, the more varied the surveyed people, the more interesting results I will get. The questions that were ask on the form were the following:

  1. Have you seen any series online recently?
  2. Do you have an active subscription to any online series viewing platform?
  3. Do you share the subscription with someone else?
  4. Have you used or do you use a pirate page to watch series?
  5. Do you usually watch the series in company?
  6. Have you ever met with friends or family to watch a series?
  7. How many hours have you spent a day watching series in the last week?
  8. How many hours a week have you spent watching series in the last month?
  9. Do you usually see the series that are recommended in trends?
  10. Do you prefer series to movies?
  11. Would you rather watch the series in a theater rather than at home?
  12. Do you usually watch the series on your smartphone?
  13. Have you ever recommended a series?
  14. Have you ever been recommended a series?
  15. Do you usually stay up late watching series?
  16. Have you missed an important task to watch a series?
  17. Do you usually watch the series without interruption?
  18. Has someone close to you told you that you watch too many series?
  19. Do you consider yourself a series addict?

(All the survey was originally sent in spanish, this is why the answers on the dataset are in that language)

Data understanding

For this section, we are going to visualize the dataset and get a better understanding of what we have.

rm(list=ls())
setwd("D:/Documentos/Universidad/Tercero/Bayesian Data Analysis/datasets")
data <- read.csv2("Series.csv",sep = ",",dec = ".")

names(data)
##  [1] "Time_Stamp"         "Age"                "Gender"            
##  [4] "x0.Recent"          "x1.Subscription"    "x2.Shared"         
##  [7] "x3.Pirate"          "x4.in.company"      "x5.meet.friends"   
## [10] "x6.day.hrs"         "x7.week.hrs"        "x8.trends"         
## [13] "x9.series.vs.films" "x10.cinema"         "x11.smartphone"    
## [16] "x12.recommend"      "x13.be.recommend"   "x14.late"          
## [19] "x15.procrastinate"  "x16.non.stop"       "x17.too.much"      
## [22] "x18.addict"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    8.00   14.88   19.50  100.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   3.255   4.000  20.000

Sensitive variables pie charts

par(mfrow = c(1,2))
pie(table(X["Age"]),main=colnames(X)["Age"],radius=1)
pie(table(X["Gender"]),main=colnames(X)["Gender"],radius=1)

Binary variables pie charts

par(mfrow = c(3,3))
for (i in 3:21){
  if(sapply(X[i], class)[1] == "character"){
    pie(table(X[,i]),main=colnames(X)[i],radius=1)
  }
}

Analizing the Age

aux1 = X[which(X$Age == "50-70"),]
aux2 = X[which(X$Age == "70+"),]

Y = rbind(aux1,aux2) #Dataset with age between 50-100

aux1 = X[which(X$Age == "1-15"),]
aux2 = X[which(X$Age == "16-21"),]
aux3 = X[which(X$Age == "22-30"),]
aux4 = X[which(X$Age == "31-50"),]

Z = rbind(aux1,aux2,aux3,aux4) #Dataset with age between 1-49
## [1] "Instances of > 50 dataset: 96"
## [1] "Instances of < 50 dataset: 106"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    6.00   11.23   12.00  100.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00   10.00   18.19   20.00  100.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   2.536   3.000  20.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   3.906   5.000  15.000

Comparing some interesting variables between the 50-70 ages and 16-21

## [1] "Hours per week mean of 50-70 dataset: 11.6373626373626"
## [1] "Hours per week mean of 16-21 dataset: 21.5769230769231"

Latent class analysis (LCA)

For this part, we want to obtain K groups of profiles according to their habits watching series.

For each observation, x=(x1,…,xM), we may assume a K-component mixture of multivariate binary variables with probability distribution:

Pr ( x K , w , θ ) = k = 1 K w K m = 1 M θ k m x m ( 1 θ k m ) ( 1 x m )

where w = { w K } and θ = { θ k m } K=1,…,K and for m=1,…,M.

We assume the following priors:

w Dirichlet ( δ 1 , , δ K )

θ k m Beta ( α k m , β k m )

In our case:

Pr ( x i m θ k m ) = θ k m x i m ( 1 θ k m ) 1 x i m , for  m = 1 , , M

The main idea is to define a latent set of variables, z={z1,…,zN}, indicating the group of each respondent. The prior probability that i-th respondent belongs to group k is:

Pr ( z i = k w , θ ) = w k

and, given that the respondent is in group k:

Pr ( x i z i = k , w , θ ) = m = 1 M θ k m x i m ( 1 θ k m ) 1 x i m

Then, the complete-data likelihood function is:

Pr ( x , z w , θ ) = k = 1 K i = 1 N [ w k m = 1 M θ k m x i m ( 1 θ k m ) 1 x i m ] I ( z i = k )

Preprocesing of the data to acomplish the assumptions

As the model assumptions said,we want to have 19 binary variables, for doing so, we have just to set some rules to variables x6 and x7 along with ignoring Time_stamp, Age and Gender variables. As we have seen the gender variable has no much to do and the age is important but not for the clustering part, we will resume the age topic when commenting the results.

X = data[4:22] #Ignore Time_stamp,Age and Gender


X<-ifelse(X=="Si",1,0) #Binarizing the variables

X = as.data.frame(X)
X$x6.day.hrs = data$x6.day.hrs #Recovering the x6 and x7 columns from the original dataset
X$x7.week.hrs = data$x7.week.hrs

hours_week = 24*7
sleep_time = 8 * 7 #Assuming 8 hours as the mean of hours sleeping
work_time = 8 * 5 #Assuming full time jobs of 8 hours per day with free weekends
eating_time = 2 * 7 #Assuming 2 hours a day spend in having lunch, having dinner, etc.

free_time = hours_week - sleep_time - work_time - eating_time #58 hours a week of free time

print(paste0("Average free time per week: ",free_time))
## [1] "Average free time per week: 58"

Given that in average we have 58 hours of free time per week (young people should have more), and given the previously commented density plots (majority of density between 0 and 25) of the hours per week, I have decided that 30 hours a week dedicated to watch series will be the threshold to input 1 on the variable x7, otherwise, the inputted value will be 0.

If we divide this 30 hours by 7, we obtain around 4.28 hours a day so I am going to input a 1 if the respondent has answer more than 4.5 and 0 otherwise.

X$x6.day.hrs = ifelse(X$x6.day.hrs >= 4.5,1,0)
X$x7.week.hrs = ifelse(X$x7.week.hrs >= 30,1,0)

Bayesian EM algorithm

In order to implement this LCA, we are going to use the BayesLCA package

library(BayesLCA) 
## Loading required package: e1071
## Loading required package: coda
set.seed(100)

First we are going to apply the algorithm with the default parameters but with k=5 (for this library the K is renamed as G) (delta=beta=alpha=1, restarts=5, iterations=500)

fit.EM=blca(X, G=3, method = "em") 
## Restart number 1, logpost = -1726.69... 
## New maximum found... Restart number 2, logpost = -1726.61... 
## Restart number 3, logpost = -1726.7... 
## Restart number 4, logpost = -1726.7... 
## Restart number 5, logpost = -1726.68...

As the EM algorithm is susceptible to converge into a local maximum or saddle point, we are going to increase the restarts from the default 5 to 15

fit.EM=blca(X, G=3, method = "em", restarts = 15) 
## Restart number 1, logpost = -1726.62... 
## Restart number 2, logpost = -1726.69... 
## Restart number 3, logpost = -1726.69... 
## Restart number 4, logpost = -1743.81... 
## New maximum found... Restart number 5, logpost = -1726.62... 
## Restart number 6, logpost = -1744.18... 
## Restart number 7, logpost = -1729.76... 
## Restart number 8, logpost = -1726.68... 
## Restart number 9, logpost = -1744.18... 
## Restart number 10, logpost = -1726.62... 
## Restart number 11, logpost = -1726.69... 
## Restart number 12, logpost = -1726.7... 
## New maximum found... Restart number 13, logpost = -1726.62... 
## Restart number 14, logpost = -1726.69... 
## Restart number 15, logpost = -1726.7...

We can check the MAP estimators of the parameters of the model:

print(fit.EM)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.874           0.993     0.792     0.100         0.806
## Group 2     0.938           0.933     0.818     0.721         0.346
## Group 3     0.364           0.311     0.079     0.145         0.480
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.421      0.232       0.109     0.440              0.459
## Group 2           0.739      0.368       0.276     0.681              0.627
## Group 3           0.150      0.000       0.000     0.102              0.202
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.041          0.147         0.946            1.000    0.166
## Group 2      0.082          0.316         1.000            1.000    0.632
## Group 3      0.091          0.093         0.567            0.923    0.140
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.082        0.247        0.075      0.021
## Group 2             0.668        0.668        0.159      0.266
## Group 3             0.000        0.110        0.105      0.026
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.473   0.335   0.193
## Warning: Posterior standard deviations not returned.

Also, we can check the prior parameters:

summary(fit.EM)
## __________________
## 
## Bayes-LCA
## Diagnostic Summary
## __________________
## 
## Hyper-Parameters: 
## 
##  Item Probabilities:
## 
##  alpha: 
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1         1               1         1         1             1
## Group 2         1               1         1         1             1
## Group 3         1               1         1         1             1
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1               1          1           1         1                  1
## Group 2               1          1           1         1                  1
## Group 3               1          1           1         1                  1
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1          1              1             1                1        1
## Group 2          1              1             1                1        1
## Group 3          1              1             1                1        1
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1                 1            1            1          1
## Group 2                 1            1            1          1
## Group 3                 1            1            1          1
## 
##  beta: 
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1         1               1         1         1             1
## Group 2         1               1         1         1             1
## Group 3         1               1         1         1             1
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1               1          1           1         1                  1
## Group 2               1          1           1         1                  1
## Group 3               1          1           1         1                  1
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1          1              1             1                1        1
## Group 2          1              1             1                1        1
## Group 3          1              1             1                1        1
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1                 1            1            1          1
## Group 2                 1            1            1          1
## Group 3                 1            1            1          1
## 
##  Class Probabilities:
## 
##  delta: 
## Group 1 Group 2 Group 3 
##       1       1       1 
## __________________
## 
## Method: EM algorithm  
## 
##  Number of iterations: 19 
## 
##  Log-Posterior Increase at Convergence: 0.001168029 
## 
##  Log-Posterior: -1726.615 
## 
##  AIC: -3569.844 
## 
##  BIC: -3765.032
par(mfrow=c(1,1))
plot(fit.EM)

Now we are going to try with different priors

fit.EM.prior2=blca.em(X, G=3,restarts=15,alpha=2,beta=2, verbose = FALSE) 
print(fit.EM.prior2)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.870           0.977     0.779     0.124         0.786
## Group 2     0.924           0.919     0.812     0.735         0.340
## Group 3     0.356           0.311     0.095     0.169         0.466
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.420      0.230       0.113     0.452              0.460
## Group 2           0.756      0.387       0.297     0.670              0.632
## Group 3           0.168      0.030       0.029     0.124              0.213
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.054          0.155         0.942            0.990    0.184
## Group 2      0.093          0.332         0.981            0.985    0.643
## Group 3      0.112          0.113         0.553            0.899    0.161
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.110        0.259        0.087      0.031
## Group 2             0.671        0.680        0.169      0.289
## Group 3             0.031        0.133        0.127      0.053
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.503   0.310   0.187
## Warning: Posterior standard deviations not returned.
fit.EM.prior0=blca.em(X, G=3,restarts=15,alpha=0.001,beta=0.001,verbose=FALSE) 
print(fit.EM.prior0)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.761           0.828     0.617     0.164         0.720
## Group 2     1.000           1.000     0.882     0.718         0.362
## Group 3     0.000           0.000     0.000     0.000         0.000
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.365      0.147       0.076     0.384              0.388
## Group 2           0.808      0.454       0.303     0.690              0.682
## Group 3           0.000      0.000       0.000     0.000              0.155
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.089          0.156         0.889            0.985    0.218
## Group 2      0.000          0.288         1.000            1.000    0.587
## Group 3      0.000          0.000         0.150            1.000    0.000
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.098        0.261        0.099      0.040
## Group 2             0.680        0.647        0.120      0.246
## Group 3             0.000        0.000        0.000      0.000
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.676   0.285   0.039
## Warning: Posterior standard deviations not returned.
par(mfrow=c(2,2))
plot(fit.EM,main="Alpha=1, Beta=1")
plot(fit.EM.prior2, main="Alpha=2, Beta=2")
plot(fit.EM.prior0,main="Alpha=0.001, Beta=0.001")

We can see that the one with alpha=2, beta=2 prior is practically identical to the uniform prior, on the other hand, the “improper” non informative prior is completely off.

Now, we can try to set different K values and use the BIC criteria to see which one has a better mixture size:

fit.EM.k.2=blca(X, 2, method = "em", restarts=15, verbose=FALSE)
fit.EM.k.4=blca(X, 4, method = "em", restarts=15, verbose=FALSE)

print(fit.EM.k.2) 
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.939           0.966     0.846     0.425         0.561
## Group 2     0.521           0.601     0.309     0.107         0.643
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.618      0.353       0.218     0.604              0.600
## Group 2           0.197      0.000       0.000     0.167              0.204
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.066          0.259         0.985            1.000    0.429
## Group 2      0.061          0.066         0.709            0.956    0.099
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.398        0.468        0.117      0.149
## Group 2             0.000        0.154        0.094      0.016
## 
## Membership Probabilities:
##  
## Group 1 Group 2 
##    0.66    0.34
## Warning: Posterior standard deviations not returned.
print(fit.EM.k.4) 
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.933           1.000     0.859     0.137         0.784
## Group 2     0.423           0.461     0.166     0.135         0.579
## Group 3     0.924           0.888     0.855     0.723         0.094
## Group 4     0.905           1.000     0.732     0.677         0.658
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.483      0.203       0.058     0.489              0.494
## Group 2           0.147      0.000       0.000     0.156              0.203
## Group 3           0.676      0.096       0.088     0.706              0.477
## Group 4           0.796      0.945       0.752     0.580              0.851
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.061          0.154         1.000            1.000    0.243
## Group 2      0.074          0.078         0.613            0.942    0.120
## Group 3      0.109          0.516         1.000            1.000    0.673
## Group 4      0.000          0.102         0.928            1.000    0.441
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.158        0.220        0.076      0.014
## Group 2             0.000        0.161        0.088      0.022
## Group 3             0.676        0.828        0.296      0.196
## Group 4             0.530        0.556        0.000      0.417
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 Group 4 
##   0.426   0.256   0.182   0.137
## Warning: Posterior standard deviations not returned.
par(mfrow=c(1,2))
plot(fit.EM.k.2,main="K = 2")
plot(fit.EM.k.4, main="K = 4")

-fit.EM$BIC
## [1] 3765.032
-fit.EM.k.2$BIC
## [1] 3778.596
-fit.EM.k.4$BIC
## [1] 3788.674

Note that this library is calculating the BIC as

and not as the classical

According to this criteria, the lowest value is the one from K=3. At last, given the estimated model parameters, we can get the predictive probability class given a observation.

Xnew = X[51,]
Zscore(Xnew,fit=fit.EM)
##                          [,1]      [,2]        [,3]
## 1100100011001100101 0.8150426 0.1768332 0.008124268

Gibbs sampling

With the previous algorithm, we could only obtain the MAP values, with Gibbs sampling we want to obtain also the whole posterior distribution.

Gibbs sampling is a particular MCMC method when the conditional posterior distributions are known.

In order to obtain a sample from the joint distribution, f ( w , θ , z x ) we can sample iteratively from the conditional posterior distributions:

  1. Sample θ ( t + 1 ) f ( θ x , w ( t ) , z ( t ) )
  2. Sample w ( t + 1 ) f ( w x , θ ( t + 1 ) , z ( t ) )
  3. Sample z ( t + 1 ) Pr ( z x , θ ( t + 1 ) , w ( t + 1 ) )

This Gibbs sampling algorithm is easy to implement since the conditional posterior distributions have a closed form. It is said that we have chosen semiconjugate priors.

Now, we can apply the Gibbs sampling algorithm for the series data. We initially use k=3 and default priors:

fit.GS=blca(X, 3, method = "gibbs") 
## Initialising sampler...starting burn-in.
## Burn-in completed...
## 1000 of 5000 samples completed...
## 2000 of 5000 samples completed...
## 3000 of 5000 samples completed...
## 4000 of 5000 samples completed...
## 5000 of 5000 samples completed...
## Sampling run completed.
print(fit.GS)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.870           0.969     0.775     0.146         0.775
## Group 2     0.920           0.917     0.814     0.729         0.321
## Group 3     0.358           0.311     0.104     0.169         0.468
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.429      0.232       0.113     0.457              0.463
## Group 2           0.767      0.392       0.305     0.666              0.635
## Group 3           0.163      0.033       0.031     0.120              0.218
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.055          0.159         0.943            0.991    0.197
## Group 2      0.095          0.344         0.979            0.983    0.657
## Group 3      0.110          0.108         0.546            0.896    0.157
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.126        0.266        0.088      0.032
## Group 2             0.672        0.689        0.176      0.308
## Group 3             0.033        0.139        0.129      0.056
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.514   0.295   0.190 
## 
## Posterior Standard Deviation Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.050           0.025     0.062     0.060         0.050
## Group 2     0.039           0.040     0.057     0.074         0.093
## Group 3     0.090           0.129     0.066     0.066         0.104
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.064      0.053       0.038     0.064              0.063
## Group 2           0.072      0.071       0.068     0.069              0.070
## Group 3           0.068      0.031       0.031     0.067              0.072
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.025          0.039         0.028            0.010    0.051
## Group 2      0.041          0.075         0.021            0.017    0.086
## Group 3      0.053          0.055         0.102            0.053    0.065
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.055        0.054        0.031      0.021
## Group 2             0.080        0.079        0.056      0.080
## Group 3             0.031        0.062        0.058      0.040
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.058   0.063   0.048

Before seeing some results plots, we are going to check if the algorithm is performing well. For doing that, we are going to perform a convergence diagnosis.

par(mfrow = c(3, 3)) 
plot(fit.GS, which = 5)

raftery.diag(as.mcmc(fit.GS))
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
##                                                      
##                Burn-in  Total Lower bound  Dependence
##                (M)      (N)   (Nmin)       factor (I)
##  ClassProb 1   8        10734 3746         2.870     
##  ClassProb 2   30       35040 3746         9.350     
##  ClassProb 3   8        8420  3746         2.250     
##  ItemProb 1 1  4        5211  3746         1.390     
##  ItemProb 1 2  12       19137 3746         5.110     
##  ItemProb 1 3  8        10116 3746         2.700     
##  ItemProb 1 4  6        8930  3746         2.380     
##  ItemProb 1 5  3        4198  3746         1.120     
##  ItemProb 1 6  3        4267  3746         1.140     
##  ItemProb 1 7  3        4267  3746         1.140     
##  ItemProb 1 8  5        5673  3746         1.510     
##  ItemProb 1 9  4        4994  3746         1.330     
##  ItemProb 1 10 3        4484  3746         1.200     
##  ItemProb 1 11 4        5211  3746         1.390     
##  ItemProb 1 12 3        4484  3746         1.200     
##  ItemProb 1 13 3        4410  3746         1.180     
##  ItemProb 1 14 2        3741  3746         0.999     
##  ItemProb 1 15 4        4955  3746         1.320     
##  ItemProb 1 16 8        9056  3746         2.420     
##  ItemProb 1 17 4        4728  3746         1.260     
##  ItemProb 1 18 4        4713  3746         1.260     
##  ItemProb 1 19 4        5124  3746         1.370     
##  ItemProb 2 1  3        4129  3746         1.100     
##  ItemProb 2 2  3        4410  3746         1.180     
##  ItemProb 2 3  3        4198  3746         1.120     
##  ItemProb 2 4  3        4484  3746         1.200     
##  ItemProb 2 5  22       21906 3746         5.850     
##  ItemProb 2 6  3        4410  3746         1.180     
##  ItemProb 2 7  3        4267  3746         1.140     
##  ItemProb 2 8  2        3741  3746         0.999     
##  ItemProb 2 9  3        4267  3746         1.140     
##  ItemProb 2 10 3        4129  3746         1.100     
##  ItemProb 2 11 3        4129  3746         1.100     
##  ItemProb 2 12 3        4062  3746         1.080     
##  ItemProb 2 13 2        3995  3746         1.070     
##  ItemProb 2 14 3        4062  3746         1.080     
##  ItemProb 2 15 3        4267  3746         1.140     
##  ItemProb 2 16 3        4198  3746         1.120     
##  ItemProb 2 17 3        4062  3746         1.080     
##  ItemProb 2 18 3        4484  3746         1.200     
##  ItemProb 2 19 2        3995  3746         1.070     
##  ItemProb 3 1  4        5124  3746         1.370     
##  ItemProb 3 2  12       16167 3746         4.320     
##  ItemProb 3 3  6        10102 3746         2.700     
##  ItemProb 3 4  3        4097  3746         1.090     
##  ItemProb 3 5  6        6406  3746         1.710     
##  ItemProb 3 6  10       12624 3746         3.370     
##  ItemProb 3 7  2        3930  3746         1.050     
##  ItemProb 3 8  2        3866  3746         1.030     
##  ItemProb 3 9  10       10206 3746         2.720     
##  ItemProb 3 10 2        3995  3746         1.070     
##  ItemProb 3 11 2        3995  3746         1.070     
##  ItemProb 3 12 4        4728  3746         1.260     
##  ItemProb 3 13 8        9152  3746         2.440     
##  ItemProb 3 14 3        4198  3746         1.120     
##  ItemProb 3 15 2        3995  3746         1.070     
##  ItemProb 3 16 2        3561  3746         0.951     
##  ItemProb 3 17 4        4792  3746         1.280     
##  ItemProb 3 18 2        3866  3746         1.030     
##  ItemProb 3 19 3        4267  3746         1.140

We can see that the diagnostic shows a quick convergence (low burn-in values), but with a not that good mixing (there are slightly large values of dependence factor of various parameters)

With this analysis in mind, we can try to run another model with better hyperparameters:

fit.GS.tuned=blca(X, 3, method = "gibbs", burn.in = 50, thin = 1/5, iter = 20000)
## Initialising sampler...starting burn-in.
## Burn-in completed...
## 1000 of 20000 samples completed...
## 2000 of 20000 samples completed...
## 3000 of 20000 samples completed...
## 4000 of 20000 samples completed...
## 5000 of 20000 samples completed...
## 6000 of 20000 samples completed...
## 7000 of 20000 samples completed...
## 8000 of 20000 samples completed...
## 9000 of 20000 samples completed...
## 10000 of 20000 samples completed...
## 11000 of 20000 samples completed...
## 12000 of 20000 samples completed...
## 13000 of 20000 samples completed...
## 14000 of 20000 samples completed...
## 15000 of 20000 samples completed...
## 16000 of 20000 samples completed...
## 17000 of 20000 samples completed...
## 18000 of 20000 samples completed...
## 19000 of 20000 samples completed...
## 20000 of 20000 samples completed...
## Sampling run completed.
print(fit.GS.tuned)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.867           0.969     0.772     0.139         0.777
## Group 2     0.921           0.919     0.815     0.727         0.329
## Group 3     0.359           0.302     0.102     0.175         0.463
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.424      0.228       0.111     0.453              0.459
## Group 2           0.765      0.394       0.304     0.664              0.634
## Group 3           0.160      0.033       0.033     0.119              0.214
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.055          0.157         0.942            0.990    0.192
## Group 2      0.094          0.340         0.979            0.984    0.652
## Group 3      0.114          0.110         0.543            0.892    0.160
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.123        0.262        0.087      0.030
## Group 2             0.667        0.682        0.173      0.303
## Group 3             0.033        0.138        0.132      0.058
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.511   0.303   0.186 
## 
## Posterior Standard Deviation Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.049           0.026     0.063     0.057         0.051
## Group 2     0.038           0.038     0.057     0.073         0.083
## Group 3     0.091           0.127     0.066     0.068         0.104
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.062      0.051       0.038     0.062              0.060
## Group 2           0.072      0.070       0.066     0.067              0.070
## Group 3           0.068      0.032       0.031     0.068              0.071
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.026          0.039         0.028            0.009    0.051
## Group 2      0.041          0.071         0.020            0.017    0.082
## Group 3      0.056          0.055         0.103            0.055    0.066
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.053        0.052        0.031      0.020
## Group 2             0.076        0.074        0.054      0.077
## Group 3             0.032        0.064        0.061      0.041
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.057   0.059   0.047

We can observe now the plots of the posteriors densities for model parameters. For the item probabilities, conditional on class membership and the class probabilities:

par(mfrow=c(3,3))
plot(fit.GS.tuned,which=3)

par(mfrow=c(1,1))
plot(fit.GS.tuned,which=4)

par(mfrow=c(3,3))
plot(fit.GS.tuned,which=5)

We could also try different priors but as the computational cost is to high, I will just put the lines but don’t run it. For the number of K we are going to see only the BIC criteria in order to confirm that the K = 3 gives the best mixture.

## Initialising sampler...starting burn-in.
## Burn-in completed...
## 1000 of 10000 samples completed...
## 2000 of 10000 samples completed...
## 3000 of 10000 samples completed...
## 4000 of 10000 samples completed...
## 5000 of 10000 samples completed...
## 6000 of 10000 samples completed...
## 7000 of 10000 samples completed...
## 8000 of 10000 samples completed...
## 9000 of 10000 samples completed...
## 10000 of 10000 samples completed...
## Sampling run completed.
## Initialising sampler...starting burn-in.
## Burn-in completed...
## 1000 of 20000 samples completed...
## 2000 of 20000 samples completed...
## 3000 of 20000 samples completed...
## 4000 of 20000 samples completed...
## 5000 of 20000 samples completed...
## 6000 of 20000 samples completed...
## 7000 of 20000 samples completed...
## 8000 of 20000 samples completed...
## 9000 of 20000 samples completed...
## 10000 of 20000 samples completed...
## 11000 of 20000 samples completed...
## 12000 of 20000 samples completed...
## 13000 of 20000 samples completed...
## 14000 of 20000 samples completed...
## 15000 of 20000 samples completed...
## 16000 of 20000 samples completed...
## 17000 of 20000 samples completed...
## 18000 of 20000 samples completed...
## 19000 of 20000 samples completed...
## 20000 of 20000 samples completed...
## Sampling run completed.
-fit.GS.tuned$BIC
## [1] 3810.303
-fit.GS.k.2$BIC
## [1] 3812.406
-fit.GS.k.4$BIC
## [1] 3847.216

In this case we can see that K=2 and K=3 are pretty close but although K=2 wins in some runs and K=3 wins in others,we will keep working with K=3 just for getting the three main groups, addicted, non-addicted and don’t care about series.

Variational Bayes

The idea consists in approximating the posterior distribution f ( ω , θ , z x ) with a variational distribution q ( w , θ , z ) which assumes independence among block of parameters:

q ( w , θ , z ) = q 1 ( w γ ) q 2 ( θ ζ ) q 3 ( z ϕ )

where ( γ , ζ , ϕ ) are the variational parameters.

The VB approach looks for the distributions qj that minimize the Kullback-Leibler divergence between the posterior and variational approximation.

In mixture models, it can be shown that the form of qj is the same as that of the conditional posterior distribution.

w γ D i r i c h l e t ( γ 1 , , γ k )

θ k m ζ B e t a ( ζ k m 1 , ζ k m 2 )

z i ϕ M u l t i n o m i a l ( ϕ 1 , , ϕ n )

The variational parameters are updated iteratively until the KL divergence is minimized.

Now, we apply the Variational Bayes (VB) algorithm for the series data:

fit.VB=blca(X, 3, method = "vb") 
## Restart number 1, logpost = -1863.01...
print(fit.VB)
## 
## MAP Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.859           0.985     0.761     0.109         0.790
## Group 2     0.940           0.931     0.823     0.738         0.341
## Group 3     0.334           0.227     0.057     0.160         0.441
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.402      0.217       0.101     0.441              0.447
## Group 2           0.762      0.380       0.287     0.678              0.635
## Group 3           0.156      0.001       0.001     0.077              0.198
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.044          0.145         0.945            1.000    0.168
## Group 2      0.081          0.322         0.998            1.000    0.644
## Group 3      0.096          0.096         0.521            0.911    0.154
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.093        0.251        0.077      0.021
## Group 2             0.674        0.676        0.157      0.277
## Group 3             0.001        0.104        0.118      0.030
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.516   0.317   0.167 
## 
## Posterior Standard Deviation Estimates:
##  
## 
## Item Probabilities:
##  
##         x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## Group 1     0.034           0.015     0.041     0.031         0.040
## Group 2     0.032           0.033     0.048     0.054         0.058
## Group 3     0.078           0.071     0.045     0.063         0.082
##         x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## Group 1           0.047      0.040       0.030     0.048              0.048
## Group 2           0.053      0.059       0.056     0.057              0.059
## Group 3           0.063      0.028       0.028     0.050              0.068
##         x10.cinema x11.smartphone x12.recommend x13.be.recommend x14.late
## Group 1      0.022          0.035         0.024            0.009    0.037
## Group 2      0.036          0.057         0.016            0.015    0.059
## Group 3      0.053          0.053         0.083            0.052    0.063
##         x15.procrastinate x16.non.stop x17.too.much x18.addict
## Group 1             0.029        0.042        0.027      0.017
## Group 2             0.057        0.057        0.046      0.055
## Group 3             0.028        0.055        0.057      0.038
## 
## Membership Probabilities:
##  
## Group 1 Group 2 Group 3 
##   0.035   0.032   0.026

Observe that the VB method is much more faster than the Gibbs sampling. And VB also provides posterior standard deviation estimates.

fit.VB$itemprob 
##      x0.Recent x1.Subscription  x2.Shared x3.Pirate x4.in.company
## [1,] 0.8588870       0.9850253 0.76086159 0.1085302     0.7895723
## [2,] 0.9397215       0.9312647 0.82346703 0.7380368     0.3406571
## [3,] 0.3339084       0.2268142 0.05678021 0.1598690     0.4414827
##      x5.meet.friends  x6.day.hrs  x7.week.hrs  x8.trends x9.series.vs.films
## [1,]       0.4021508 0.216619346 0.1012336036 0.44082576          0.4471899
## [2,]       0.7619196 0.380476637 0.2873263199 0.67821731          0.6354185
## [3,]       0.1558255 0.001013152 0.0009432598 0.07663585          0.1978054
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend  x14.late
## [1,] 0.04402172     0.14517401     0.9449849        0.9999985 0.1683718
## [2,] 0.08082083     0.32213651     0.9978032        1.0000000 0.6437150
## [3,] 0.09596543     0.09564748     0.5211017        0.9109057 0.1541671
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]       0.093463816    0.2509851   0.07663344 0.02110870
## [2,]       0.674124467    0.6760970   0.15677631 0.27729064
## [3,]       0.001478075    0.1040733   0.11771631 0.03049472
fit.VB$classprob 
## [1] 0.5160120 0.3173026 0.1666854
fit.VB$itemprob.sd
##       x0.Recent x1.Subscription  x2.Shared  x3.Pirate x4.in.company
## [1,] 0.03427877      0.01481141 0.04147807 0.03091188    0.03973103
## [2,] 0.03187497      0.03345522 0.04753483 0.05414695    0.05805332
## [3,] 0.07840403      0.07073988 0.04521453 0.06329694    0.08206263
##      x5.meet.friends x6.day.hrs x7.week.hrs  x8.trends x9.series.vs.films
## [1,]      0.04738550 0.04012982  0.03006293 0.04795733         0.04802399
## [2,]      0.05257888 0.05937866  0.05560678 0.05727905         0.05889846
## [3,]      0.06276362 0.02770660  0.02767599 0.04962291         0.06781280
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend   x14.late
## [1,] 0.02155839     0.03465624    0.02353048      0.009325663 0.03666081
## [2,] 0.03554304     0.05729446    0.01589993      0.014902511 0.05862226
## [3,] 0.05339506     0.05333654    0.08250247      0.052105426 0.06254173
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]        0.02911369   0.04212669   0.02687401 0.01650633
## [2,]        0.05745521   0.05737089   0.04555160 0.05505337
## [3,]        0.02790921   0.05485095   0.05715416 0.03822850
fit.VB$classprob.sd
## [1] 0.03482425 0.03243420 0.02611784

MAP estimates are close to those obtained with the Gibss sampling algorithm:

fit.GS$itemprob 
##      x0.Recent x1.Subscription x2.Shared x3.Pirate x4.in.company
## [1,] 0.8699643       0.9694923 0.7750651 0.1458995     0.7753410
## [2,] 0.9198235       0.9166612 0.8137024 0.7288890     0.3207982
## [3,] 0.3577859       0.3114059 0.1042425 0.1692080     0.4681448
##      x5.meet.friends x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films
## [1,]       0.4286518 0.23180909  0.11325459 0.4572638          0.4626439
## [2,]       0.7674769 0.39189486  0.30536077 0.6662314          0.6347646
## [3,]       0.1625938 0.03321466  0.03126934 0.1200338          0.2182409
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend  x14.late
## [1,] 0.05460469      0.1587203     0.9430629        0.9905514 0.1965169
## [2,] 0.09475073      0.3438776     0.9785950        0.9830372 0.6569047
## [3,] 0.11047678      0.1077013     0.5455975        0.8956746 0.1571282
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]        0.12636189    0.2655272   0.08843145 0.03171579
## [2,]        0.67234303    0.6890862   0.17560979 0.30795738
## [3,]        0.03340858    0.1389569   0.12938987 0.05592395
fit.VB$itemprob 
##      x0.Recent x1.Subscription  x2.Shared x3.Pirate x4.in.company
## [1,] 0.8588870       0.9850253 0.76086159 0.1085302     0.7895723
## [2,] 0.9397215       0.9312647 0.82346703 0.7380368     0.3406571
## [3,] 0.3339084       0.2268142 0.05678021 0.1598690     0.4414827
##      x5.meet.friends  x6.day.hrs  x7.week.hrs  x8.trends x9.series.vs.films
## [1,]       0.4021508 0.216619346 0.1012336036 0.44082576          0.4471899
## [2,]       0.7619196 0.380476637 0.2873263199 0.67821731          0.6354185
## [3,]       0.1558255 0.001013152 0.0009432598 0.07663585          0.1978054
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend  x14.late
## [1,] 0.04402172     0.14517401     0.9449849        0.9999985 0.1683718
## [2,] 0.08082083     0.32213651     0.9978032        1.0000000 0.6437150
## [3,] 0.09596543     0.09564748     0.5211017        0.9109057 0.1541671
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]       0.093463816    0.2509851   0.07663344 0.02110870
## [2,]       0.674124467    0.6760970   0.15677631 0.27729064
## [3,]       0.001478075    0.1040733   0.11771631 0.03049472
fit.GS.tuned$classprob 
## [1] 0.5107913 0.3027477 0.1864610
fit.VB$classprob
## [1] 0.5160120 0.3173026 0.1666854

However, the Gibbs sampling provides a better approximation of the posterior distributions. Observe that the posterior standard deviation estimates are larger than those obtained for the VB method, it is cause by the enforced independences between parameters imposed in VB

fit.GS.tuned$itemprob.sd
##       x0.Recent x1.Subscription  x2.Shared  x3.Pirate x4.in.company
## [1,] 0.04857874      0.02558162 0.06291761 0.05720351    0.05051660
## [2,] 0.03769376      0.03811155 0.05652233 0.07253795    0.08266426
## [3,] 0.09142739      0.12691456 0.06579413 0.06759151    0.10354081
##      x5.meet.friends x6.day.hrs x7.week.hrs  x8.trends x9.series.vs.films
## [1,]      0.06216102 0.05083708  0.03823216 0.06239750         0.05990135
## [2,]      0.07183806 0.06987098  0.06648555 0.06675373         0.06962007
## [3,]      0.06765567 0.03186953  0.03132139 0.06834100         0.07085235
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend   x14.late
## [1,] 0.02556842     0.03933352    0.02792014      0.009396706 0.05081652
## [2,] 0.04124470     0.07069300    0.02040096      0.017307536 0.08170084
## [3,] 0.05607453     0.05549105    0.10311566      0.055415126 0.06599695
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]        0.05282257   0.05184842   0.03076281 0.01981604
## [2,]        0.07609116   0.07390731   0.05425701 0.07660064
## [3,]        0.03161159   0.06359193   0.06121339 0.04085680
fit.VB$itemprob.sd
##       x0.Recent x1.Subscription  x2.Shared  x3.Pirate x4.in.company
## [1,] 0.03427877      0.01481141 0.04147807 0.03091188    0.03973103
## [2,] 0.03187497      0.03345522 0.04753483 0.05414695    0.05805332
## [3,] 0.07840403      0.07073988 0.04521453 0.06329694    0.08206263
##      x5.meet.friends x6.day.hrs x7.week.hrs  x8.trends x9.series.vs.films
## [1,]      0.04738550 0.04012982  0.03006293 0.04795733         0.04802399
## [2,]      0.05257888 0.05937866  0.05560678 0.05727905         0.05889846
## [3,]      0.06276362 0.02770660  0.02767599 0.04962291         0.06781280
##      x10.cinema x11.smartphone x12.recommend x13.be.recommend   x14.late
## [1,] 0.02155839     0.03465624    0.02353048      0.009325663 0.03666081
## [2,] 0.03554304     0.05729446    0.01589993      0.014902511 0.05862226
## [3,] 0.05339506     0.05333654    0.08250247      0.052105426 0.06254173
##      x15.procrastinate x16.non.stop x17.too.much x18.addict
## [1,]        0.02911369   0.04212669   0.02687401 0.01650633
## [2,]        0.05745521   0.05737089   0.04555160 0.05505337
## [3,]        0.02790921   0.05485095   0.05715416 0.03822850
fit.GS.tuned$classprob.sd
## [1] 0.05724970 0.05892567 0.04728239
fit.VB$classprob.sd
## [1] 0.03482425 0.03243420 0.02611784

We may also observe these differences in the plots of density estimates for model parameters. Both for the item and class probabilities:

par(mfrow = c(3,3))
plot(fit.GS.tuned,which=3)

par(mfrow = c(3,3))
plot(fit.VB,which=3)

K-mean clustering

As an extra plot we are going to use the factoextra library in order to make a K-Means clustering (Almost the same as using the bayesian EM). With this library we can get a 2D representation of the different clusters as the algorithm apply as postprocessing step a PCA.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
set.seed(150)

km.res = kmeans(X, 3, nstart = 40)
fviz_cluster(km.res,X,ellipse = TRUE)

print(km.res)
## K-means clustering with 3 clusters of sizes 53, 85, 64
## 
## Cluster means:
##   x0.Recent x1.Subscription  x2.Shared x3.Pirate x4.in.company x5.meet.friends
## 1 0.4150943       0.4716981 0.05660377 0.1509434     0.5471698       0.1320755
## 2 0.9294118       0.9882353 0.91764706 0.1058824     0.8117647       0.4941176
## 3 0.9375000       0.9531250 0.82812500 0.7343750     0.3281250       0.7343750
##   x6.day.hrs x7.week.hrs x8.trends x9.series.vs.films x10.cinema x11.smartphone
## 1 0.03773585  0.03773585 0.1698113          0.2075472 0.07547170      0.0754717
## 2 0.25882353  0.11764706 0.4705882          0.4823529 0.04705882      0.1764706
## 3 0.35937500  0.26562500 0.6718750          0.6562500 0.07812500      0.3125000
##   x12.recommend x13.be.recommend  x14.late x15.procrastinate x16.non.stop
## 1     0.6792453        0.9433962 0.1132075        0.03773585    0.1320755
## 2     0.9411765        1.0000000 0.2000000        0.04705882    0.2235294
## 3     1.0000000        1.0000000 0.6406250        0.73437500    0.7343750
##   x17.too.much x18.addict
## 1   0.07547170 0.03773585
## 2   0.08235294 0.02352941
## 3   0.17187500 0.26562500
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 2 3 1 1 3 3 2 2 1 3 1 2 1 3 2 2 2 1 2 2 2 2 1 1 3 1 1 2 3
##  [38] 3 1 2 3 1 2 2 3 1 1 2 2 2 2 1 2 1 1 3 1 2 1 1 2 2 3 1 2 2 2 1 2 2 2 3 3 1
##  [75] 1 3 1 2 3 1 2 1 2 3 2 2 2 2 1 2 1 1 3 3 2 2 2 3 2 1 2 2 3 2 2 3 2 2 2 2 1
## [112] 2 2 1 2 1 3 1 3 2 3 2 1 1 3 2 3 2 1 2 3 1 1 2 3 3 3 1 3 2 1 3 2 2 3 3 2 3
## [149] 3 3 2 1 1 3 3 1 2 3 2 2 3 2 3 3 3 2 3 2 3 3 1 1 2 2 2 3 1 1 2 3 2 2 2 3 2
## [186] 3 3 2 1 2 3 1 1 3 2 1 2 2 2 2 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 115.6604 184.2118 190.2188
##  (between_SS / total_SS =  23.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Comparing the different approaches results

Now that we have all the models implemented, we are going to plot the clusters results for each model according to the different ages and gender.

cluster1 = data[which(km.res$cluster == 1),] 
cluster2 = data[which(km.res$cluster == 2),] 
cluster3 = data[which(km.res$cluster == 3),] 

par(mfrow=c(1,3))
pie(table(cluster1["Age"]),main="Cluster 1 K-means",radius=1)
pie(table(cluster2["Age"]),main="Cluster 2 K-means",radius=1)
pie(table(cluster3["Age"]),main="Cluster 3 K-means",radius=1)

par(mfrow=c(1,2))

a = dim(cluster1[which(cluster1$Age=="16-21"),])[1]
b = dim(cluster1[which(cluster1$Age=="22-30"),])[1]

c = dim(cluster2[which(cluster2$Age=="16-21"),])[1]
d = dim(cluster2[which(cluster2$Age=="22-30"),])[1]

e = dim(cluster3[which(cluster3$Age=="16-21"),])[1]
f = dim(cluster3[which(cluster3$Age=="22-30"),])[1]

total = a+b+c+d+e+f

cluster1_prop = (a+b)/total
cluster2_prop = (c+d)/total
cluster3_prop = (e+f)/total

pie(x=c(cluster1_prop,cluster2_prop,cluster3_prop),main="Ages 16-30 Cluster Prop K-means",radius=1,labels = c("C1","C2","C3"))

a = dim(cluster1[which(cluster1$Age=="31-49"),])[1]
b = dim(cluster1[which(cluster1$Age=="50-70"),])[1]

c = dim(cluster2[which(cluster2$Age=="31-49"),])[1]
d = dim(cluster2[which(cluster2$Age=="50-70"),])[1]

e = dim(cluster3[which(cluster3$Age=="31-49"),])[1]
f = dim(cluster3[which(cluster3$Age=="50-70"),])[1]

total = a+b+c+d+e+f

cluster1_prop = (a+b)/total
cluster2_prop = (c+d)/total
cluster3_prop = (e+f)/total

pie(x=c(cluster1_prop,cluster2_prop,cluster3_prop),main="Ages 31-70 Cluster Prop K-means",radius=1,labels = c("C1","C2","C3"))

par(mfrow=c(1,3))
pie(table(cluster1["Gender"]),main="Cluster 1 K-means",radius=1)
pie(table(cluster2["Gender"]),main="Cluster 2 K-means",radius=1)
pie(table(cluster3["Gender"]),main="Cluster 3 K-means",radius=1)

Note that the following algorithms doesn’t need to have the same cluster order, that means, cluster 1 in the K-means could be cluster 3 on the GS. Anyway, it should be easy to compare the same classes as the shapes are going to be more or less the same.

ClusterProbEM = as.data.frame(Zscore(X,fit=fit.EM)) 
colnames(ClusterProbEM) = c(1,2,3)
ClusterNumbersEM = as.integer(colnames(ClusterProbEM)[apply(ClusterProbEM,1,which.max)])

ClusterProbGS = as.data.frame(Zscore(X,fit=fit.GS.tuned)) 
colnames(ClusterProbGS) = c(1,2,3)
ClusterNumbersGS = as.integer(colnames(ClusterProbGS)[apply(ClusterProbGS,1,which.max)])

ClusterProbVB = as.data.frame(Zscore(X,fit=fit.VB)) 
colnames(ClusterProbVB) = c(1,2,3)
ClusterNumbersVB = as.integer(colnames(ClusterProbVB)[apply(ClusterProbVB,1,which.max)])

## [1] "Male respondants: 72"
## [1] "Female respondants: 126"

As we can see, the gender has nothing to do with the clusters but young people tend more to by addicted. Also, the results between different models are not such a difference, the higher differences are found between frequentist and bayesian but are also similar. Due to the computational costs, the frequentist K-means and the VB will be the models to choose.

Conclusions